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ABSTRACT 

We discuss a programming language for real-time audio signal 
processing that is embedded in the functional language Haskell 
and uses the Low-Level Virtual Machine as back-end. With that 
framework we can code with the comfort and type safety of 
Haskell while achieving maximum efficiency of fast inner loops 
and full vectorisation. This way Haskell becomes a valuable alter- 
native to special purpose signal processing languages. 

1. INTRODUCTION 

Given a data flow diagram as in Figure [T] we want to generate an 
executable machine program as in Figure|4] First the diagram must 
be translated to something that is more accessible by a machine. 
Since we can translate data flows almost literally to function ex- 
pressions we choose a functional programming language as the 
target language, here Haskell |1|. The result can be seen in 
Figure [2] This translation must be done manually but in future it 
could also be supported by a flow diagram editor like PureData (2) 
or Gems |3|. The second step is to translate the function expres- 
sion to a machine oriented presentation. This is the main concern 
of our paper. 

Since we represent signals as sequences of numbers, signal 
processing algorithms are usually loops that process these num- 
bers one after another. Thus our goal is to generate efficient loop 
bodies as in Figure[3]from a functional signal processing represen- 
tation. We have chosen the Low-Level Virtual-Machine (LLVM) 
Pfl for the loop description, because ILLVMI provides a universal 
representation for machine languages of a wide range of proces- 
sors. The ILLVMl library is responsible for the third step, namely 
the translation of portable virtual machine code to actual machine 
code of the host processor. 
Our contributions are 

• a representation of an lLLVMl loop body that can be treated 
like a signal, described in Section lTTl 

• a way to describe causal signal processes which is the dom- 
inant kind of signal transformations in real-time audio pro- 
cessing and which allows us to cope efficiently with multi- 
ple uses of outputs and with feedback of even small delays, 
guaranteed deadlock free, developed in Section [3~2l 

• a handling of internal filter parameters in a way that is 
much more flexible than traditional control rate/sample rate 
schemes, presented in Section [3~3l 

• support for the vector units of modern processors both for 
non-recursive and recursive signal processes as derived in 
Sectionl3n 

• a method for compiling a signal processing algorithm once 
and run it with different parameters as shown in Section [331 

• and a speed comparison with well established signal pro- 
cessing packages in Section|5] 
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Figure 1 : Dataflow for creation of a very simple percussive sound 



amplify 

(exponential halfLife amp) 
(osci Wave . saw phase freq) 

Figure 2: Functional expression for the diagram in Figure\l\ 



define generate 

(phase, freq, amp, decay, ptr, counter) { 
goto enter 



loop : 
yO 

yi 
y2 



mul 2, phase ; phase \in [0,1] 

sub 1, yO ; saw = l-2*phase 

mul amp, yl ; out = amp * saw 

store y2, ptr ; write to buffer 

amp := mul amp, decay 

; update amp 
phaseTmp := add phase, freq 

; update phase 
phase := fraction phaseTmp 

; wrap phase to [0,1] 

counter := sub counter, 1 
ptr := add ptr, sizeof (y2) 

enter : 

if counter>0 then goto loop 

} 

Figure 3: Simplified LLVM assembly code to be generated from 
the function expression in Figure [2] In our example scalar and 
vectorised loop are the same. 
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# (phase, phase+freq, phase+2*f req, ...) 



movaps 


phases 


, %xmmO 






# (amp, amp/2 A (- 


1/halfLife) , . . . ) 


movaps 


amps , 


%xmml 






# (4*freq, 


4*freq, 4*freq, 


. . . ) 


movaps 


f reqs , 


%xmm2 






# (2" (-4/halfLife) , 2 A (- 


-4/halfLife), . 


movaps 


decays 


, %xmm3 






movaps 


ones , 


%xmm4 


# 


(1,1,1,1) 


jmp 


enter 








loop : 










movaps 


%xmm3 , 


%xmm5 






mulps 


%xmml , 


%xmm5 


# 


amp* dec ays 


movaps 


%xmm2 , 


%xmm6 






addps 


%xmmO , 


%xmm6 


# 


t' := t+freq 


mulps 


twos , 


%xmmO 


# 


2*t 


movaps 


%xmm4 , 


%xmm7 






subps 


%xmmO , 


%xmm7 


# 


l-2*t 


mulps 


%xmml , 


%xmm7 


# 


amp * ( 1 - 2 * t ) 


movaps 


%xmm7 , 


( %ecx) 


# 


store 


addl 


$16, % 


ecx 


# 


next packet 


cvttps2dq 


%xmm6 , 


%xmmO 






cvtdq2ps 


%xmmO , 


%xmmO 


# 


truncate t' 


subps 


%xmmO , 


%xmm6 


# 


fraction t' 


movaps 


%xmm5 , 


%xmml 






movaps 


%xmm6 , 


%xmmO 






decl 


%edx 




# 


counter 


enter : 










testl 


%edx, 


%edx 






jne 


loop 









Figure 4: Intel x86 assembly loop for Figure\3\using SSE vector 
instructions. 



2. BACKGROUND 

We want to generate ILLVMI code from a signal processing algo- 
rithm written in a declarative way. We like to write code close 
to a data flow diagram and the functional paradigm seems to be 
appropriate. 

We could design a new language specifically for this purpose, 
but we risk the introduction of design flaws. We could use an ex- 
isting signal processing language, but we are afraid that it does 
not scale well to applications other than signal processing. Alter- 
natively we can resort to an existing general purpose functional 
programming language or a subset of it, and write a compiler with 
optimisations adapted to signal processing needs. But writing a 
compiler for any modern "real-world" programming language is a 
task of several years, if not decades. A compiler for a subset of 
an existing language however would make it hard to interact with 
existing libraries. So we can still tune an existing compiler for an 
existing language, but given the complexity of modern languages 
and their respective compilers this is still a big effort. It might turn 
out that a change that is useful for signal processing kills perfor- 
mance for another application. 

A much quicker way to adapt a language to a special purpose is 
the Embedded Domain Specific Language (EDSL) approach (5). 
In this terminology "embedded" means that the domain specific 
(or "special purpose") language is actually not an entirely new 
language, but a way to express domain specific issues using corre- 
sponding constructs and checks of the host language. For example, 



writing an SQL command as string literal in Java and sending it to 
a database, is not an lEDSLl In contrast to that, Hibernate (6) is 
an lEDSLl because it makes database table rows look like ordinary 
Java objects and it makes the use of foreign keys safe and comfort- 
able by making foreign references look like Java references. 

In the same way we want to cope with signal processing in 
Haskell. In the expression 

amplify 

(exponential halfLife amp) 
(osci Wave . saw phase freq) 

the call to osci shall not produce a signal, but instead it shall gen- 
erate [O^Ml code that becomes part of a signal generation loop 
later. In the same way amplify assembles the code parts pro- 
duced by exponential and osci and defines the product of 
their results as its own result. In the end every such signal expres- 
sion is actually a high-level ILLVMI macro and finally, we pass it 
to a driver function that compiles and runs the code. Where Hi- 
bernate converts Java expressions to SQL queries, sends them to a 
database and then converts the database answers back to Java ob- 
jects, we convert Haskell expressions to LLVM bitcode, send 
it to the LLVM Just-In-Time (JIT) compiler and then execute the 
resulting code. We can freely exchange signal data between pure 
Haskell code and LLVM generated code. 

The lEDSLl approach is very popular among Haskell pro- 
grammers. For instance interfaces to the CSound signal processing 
language |7| and the real-time software synthesiser SuperCollider 
(81 are written this way. This popularity can certainly be attributed 
to the concise style of writing Haskell expressions and to the 
ease of overloading number literals and arithmetic operators. We 
shall note that the lEDSLl method has its own shortcomings, most 
notably the sharing problem that we tackle in Section [T2l 

In (9) we have argued extensively, why we think that 
Haskell is a good choice for signal processing. Summarised, 
the key features for us are polymorphic but strong static typing 
and lazy evaluation 1101 . Strong typing means that we have a wide 
range of types that the compiler can distinguish between. This way 
we can represent a trigger or gate signal by a sequence of boolean 
values (type Bool) and this cannot be accidentally mixed up with 
a PCM signal (sample type Int8), although both types may be 
represented by bytes internally. We can also represent internal pa- 
rameters of signal processes by opaque types that can be stored by 
the user but cannot be manipulated (cf. Section [33l >. Polymorphic 
typing means that we can write a generic algorithm that can be 
applied to single precision or double precision floating point num- 
bers, to fixed point numbers or complex numbers. Static typing 
means that the Haskell compiler can check that everything fits 
together when compiling a program or parts of it. Lazy evaluation 
means that we can transform audio data as it becomes available 
while programming in a style that treats those streams as if they 
were available at once. 

The target language of our embedded compiler is ILLVMI It 
differs from CSound and SuperCollider in that ILLVMI is not a 
signal processing language. It is a high-level assembler and we 
have to write the core signal processing building blocks ourselves. 
However, once this is done assembling those blocks is as simple 
as writing CSound or SuperCollider/SCLang programs. We could 
have chosen a concrete machine language as target, but ILLVMI 
does a much better job for us: It generates machine code for many 
different processors, thus it can be considered a portable assem- 
bler. It also supports the vector units of modern processors (Sec- 
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tion 13.4b and target dependent instructions (intrinsics) and pro- 
vides us with a large set of low-level to high-level optimisations, 
that we can even select and arrange individually. We can run 
ILLVMI code immediately from our Haskell programs jJITb . but 
we can also write ILLVMI bitcode files for debugging or external 
usage. 

3. IMPLEMENTATION 

We are now going to discuss the design of our implementation 

QUCES). 

3.1. Signal generator 

In our design a signal is a sequence of sample values and a signal 
generator is a state transition system that ships a single sample per 
request while updating the state. E.g. the state of an exponential 
curve is the current amplitude and on demand it returns the cur- 
rent amplitude as result while decreasing the amplitude state by a 
constant factor. In the same way an oscillator uses the phase as 
internal state. Per request it applies a wave function on the phase 
and delivers the resulting value as current sample. Additionally it 
increases the phase by the oscillator frequency and wraps around 
the result to the interval [0, 1). This design is much inspired by 

According to this model we define an lLLViVH signal generator 
in Haskell essentially as a pair of an initial state and a function 
that returns a tuple containing a flag showing whether there are 
more samples to come, the generated sample and the updated state. 

type Generator a = 
(state, 

state -> Code (Value Bool, (a, state) ) ) 

The lower-case identifiers are type variables that can be instan- 
tiated with actual types. The variable a is for the sample type 
and state for the internal state of the signal generator. Since 
Generator is not really a signal but a description for ILLVM1 
code, the sample type cannot be just a Haskell number type 
like Float or Double. Instead it must be the type for one 
of ILLVMl s virtual registers, namely Value Float or Value 
Double, respectively. The types Value and Code are imported 
from a Haskell interface to lLLVMlfT4l . 

The type parameter is not restricted in any way, thus we can 
implement a generator of type Generator (Value Float, 
Value Float) for a stereo signal generator or Generator 
(Value Bool, Value Float) for a gate signal and a con- 
tinuous signal that are generated synchronously. We do not worry 
about a layout in memory of an according signal at this point, since 
it may be just an interim signal that is never written to memory. 
E.g. the latter of the two types just says, that the generated sam- 
ples for every call to the generator can be found in two virtual 
registers, where one register holds a boolean and the other one a 
floating point number. 

We like to complement this general description with the simple 
example of an exponential curve generator. 

exponential : : 

Float -> Float -> Generator (Value Float) 
exponential halfLife amp = 

(valueOf amp, 

\y0 -> do 



yl <- mul yO (valueOf 

(2** (-1/halfLife) :: Float)) 
return (valueOf True, (yO, yl))) 

For simplification we use the fixed type Float but in the real 
implementation the type is flexible. The implementation is the 
same, only the real type of exponential is considerably more 
complicated because of many constraints to the type parameters. 

The function valueOf makes a Haskell value available as 
constant in lLLVMl code. Thus the power computation with ** in 
the mul instruction is done by Haskell and then implanted into 
the lLLVMl code. This also implies that the power is computed only 
once. The whole transition function, that is the second element of 
the pair, is a lambda expression, also known as anonymous func- 
tion. It starts with a backslash and its argument y which identifies 
the virtual register that holds the current internal state. It returns 
always True because it never terminates and it returns the current 
amplitude yO as current sample and the updated amplitude com- 
puted by a multiplication to be found in the register identified by 
yl. 

We have seen how basic signal generators work, however, sig- 
nal processing consists largely of transforming signals. In our 
framework a signal transformation is actually a generator transfor- 
mation. That is we take apart given generators and build something 
new from them. For example the controlled amplifier dissects the 
envelope generator and the input generator and assembles a gener- 
ator for the amplified signal. 

amplify : : 

Generator (Value Float) -> 
Generator (Value Float) -> 
Generator (Value Float) 
amplify (envlnit, envTrans) 
(inlnit, inTrans) = 
((envlnit, inlnit), 
(\ (eO, iO) -> do 

(eCont, (ev,el)) <- envTrans eO 
(iCont, (iv, il) ) <- inTrans iO 
y <- mul ev iv 
oont <- and eCont iCont 
return (cont, (y, (el,il))))) 

So far our signals only exist as ILLVMI code, but computing 
actual data is straightforward: 

render : : 

Generator (Value Float) -> 
Value Word32 -> Value (Ptr Float) -> 
Code (Value Word32) 
render (start, next) size ptr = do 

(pos,_) <- arrayLoop size ptr start $ 
\ ptri sO -> do 

(cont, (y,sl)) <- next sO 
ifThen cont () (store y ptri) 
return (cont, si) 
ret pos 

The ugly branching that is typical for assembly languages includ- 
ing that of lLLVMl is hidden in our custom functions arrayLoop 
and ifThen. Haskell makes a nice job as macro assembler. 
Again, we only present the most simple case here. The alternative 
to filling a single buffer with signal data is to fill a sequence of 
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chunks that are created on demand. This is called lazy evaluation 
and one of the central features of Haskell. 

At this point, we might wonder whether the presented model 
of signal generators is general enough to match all kinds of signals 
that can appear in real applications. The answer is yes, since given 
a signal there is a generator that emits that signal. We simply write 
the signal to a buffer and then use a signal generator that manages 
a pointer into this buffer as internal state. This generator has a 
real-world use when reading a signal from a file. We see that our 
model of signal generators does not impose a restriction on the 
kind of signals, but it well restricts the access to the generated 
data: We can only traverse from the beginning to the end of the 
signal without skipping any value. This is however intended since 
we want to play the signals in real-time. 

3.2. Causal Processes 

While the above approach of treating signal transformations as sig- 
nal generator transformations is very general it can be inefficient. 
For example for a signal generator x the expression mix x x 
does not mean that the signal represented by x is computed once 
and then mixed with itself. Instead, the mixer runs the signal gen- 
erator x twice and adds the results of both instances. I like to 
call that the sharing problem. It is inherent to all DSLs that are 
embedded into a purely functional language, since in those lan- 
guages objects have no identity, i.e. you cannot obtain an object's 
address in memory. The sharing problem also occurs if we process 
the components of a multi-output signal process individually, for 
instance the channels of a stereo signal or the lowpass, bandpass, 
highpass components of a state variable filter. E.g. for delaying the 
right channel of a stereo signal we have to write stereo (left 
x) (delay (right x) ) and we run into the sharing prob- 
lem, again. 

We see two ways out: The first one is relying on lLLViVll s op- 
timiser to remove the duplicate code. However this may fail since 
ILLVMI cannot remove duplicate code if it relies on seemingly inde- 
pendent states, on interaction with memory or even on interaction 
with the outside world. Another drawback is that the temporar- 
ily generated code may grow exponentially compared to the code 
written by the user. E.g. in 

let y = mix x x 
z = mix y y 
in mix z z 

the generator x is run eight times. 

The second way out is to store the results of a generator and 
share the storage amongst all users of the generator. We can do 
this by rendering the signal to a lazy list, or preferably to a lazily 
generated list of chunks for higher performance. This approach is 
a solution to the general case and it would also work if there are 
signal processes involved that shrink the time line, like in mix x 

(timeShrink x) . 

While this works in the general case there are many cases 
where it is not satisfying. Especially in the example mix x x 
we do not really need to store the result of x anywhere as it is 
consumed immediately by the mixer. Storing the result is at least 
inefficient in case of a plain Haskell singly linked list and even 
introduces higher latency in case of a chunk list. 

So what is the key difference between mix x x and mix x 

(timeShrink x) ? It is certainly that in the first case data is 
processed in a synchronous way thus it can be consumed (mixed) 



as it is produced (generated by x). However, the approach of signal 
transformation by signal generator transformation cannot model 
this behaviour. When considering the expression mix x (fx) 
we have no idea whether f maintains the "speed" of its argument 
generator. That is, we need a way to express that f emits data 
synchronously to its input. For instance we could define 

type Map a b = a -> Code b 

that represents a signal transformation of type Generator a 
-> Generator b. It could be applied to a signal generator by 
a function apply with type 

Map a b -> Generator a -> Generator b 

and where we would have written f x before, we would write 
apply f x instead. 

It turns out that Map is too restrictive. Our signal process 
would stay synchronous if we allow a running state as in a recur- 
sive filter and if we allow termination of the signal process before 
the end of the input signal as in the Haskell list function take. 
Thus, what we actually use is a definition that boils down to 

type Causal a b = 

(state, (a, state) -> 

Code (Value Bool, (b, state))) 

With this type we can model all kinds of causal processes, that is 
processes where every output sample depends exclusively on the 
current and past input samples. The take function may serve as 
an example for a causal process with termination. 

take : : Int -> Causal a a 
take n = 

(valueOf n, 
\ (a, toDo) -> do 

cont <- icmp IntULT (valueOf 0) toDo 
stillToDo <- sub toDo (valueOf 1) 
return (cont, (a, StillToDo) ) ) 

The function apply for applying a causal process to a signal 
generator has the signature 

apply : : 

Causal a b -> Generator a -> Generator b 

and its implementation is straightforward. The function is neces- 
sary to do something useful with causal processes, but it loses the 
causality property. For sharing we want to make use of facts like 
that the serial composition of causal processes is causal, too, but 
if we have to express the serial composition of processes f and g 
by apply f (apply g x ), then we cannot make use of such 
laws. The solution is to combine processes with processes rather 
than transformations with signals. E.g. with >>> denoting the se- 
rial composition we can state that g >>> f is a causal process. 

In the base Haskell libraries there is already the Arrow ab- 
straction that was developed for the design of integrated circuits in 
the Lava project, but it proved to be useful for many other appli- 
cations. The Arrow type class provides a generalisation of plain 
Haskell functions. For making Causal an instance of Arrow 
we must provide the following minimal set of methods and warrant 
the validity of the arrow laws 1 15 1. 



DAFX-4 



Proc. of the 13" Int. Conference on Digital Audio Effects (DAFx-10), Graz, Austria , September 6-10, 2010 



arr : : (a -> b) -> Causal a b 
(»>) : : 

Causal a b -> Causal b c -> Causal a c 
first : : Causal a b -> Causal (a,c) (b,c) 

The infix operator >>> implements (serial) function composition, 
the function first allows for parallel composition, and the func- 
tion arr generates stateless transformations including rearrange- 
ment of tuples as needed by first. It turns out that all of these 
combinatory maintain causality. They allow us to express all kinds 
of causal processes without feedback. If f and mix are causal 
processes, then we can translate the former mix x (f x) to 

arr (\x -> (x,x) ) >>> second f >>> mix 
where second p = arr swap >>> p >>> arr swap 
swap (a, b) = (b, a) 

For implementation of feedback we need only one other com- 
binator, namely loop. 

loop : : c -> Causal (a, c) (b,c) -> Causal a b 

The function loop feeds the output of type c of a process back 
to its input channel of the same type. In contrast to the loop 
method of the standard Ar r owLoop class we must delay the value 
by one sample and thus need an initial value of type c for the 
feedback signal. Because of the way loop is designed it cannot 
run into deadlocks. In general deadlocks can occur whenever a 
signal processor runs ahead of time, that is it requires future input 
data in order to compute current output data. Our notion of a causal 
process excludes this danger. 

In fact, feedback can be considered another instance of the 
sharing problem and loop is its solution. For instance if we want 
to compute a comb filter for input signal x and output signal y 
then the most elegant solution in Haskell is to represent x and 
y by lists and write the equation let y = x + delay y in 
y which can be solved lazily by the Haskell runtime system. In 
contrast to that if x and y are signal generators this would mean to 
produce infinitely large code since it holds 

y = x + delay y 

= x + delay (x + delay y) 

= x + delay (x + delay (x + delay y) ) ... 

With loop however we can share the output signal y with its oc- 
currences on the right hand side. Therefore, the code would be 

y = apply (arr mixFanout >>> second delay) x 
where mixFanout (a,b) = (a+b,a+b) 

Since the use of arrow combinators is somehow less intuitive 
than regular function application and Haskell's recursive let 
syntax, there is a preprocessor that translates a special arrow syn- 
tax into the above combinators. Further on there is a nice abstrac- 
tion of causal processes, namely commutative causal arrows 1 16 1. 

We like to note that we can even express signal processes that 
are causal with respect to one input and non-causal with respect 
to another one. E.g. frequency modulation is causal with respect 
to the frequency control but non-causal with respect to the input 
signal. This can be expressed by the type 

freqMod : : Generator (Value a) -> 

Causal (Value a) (Value a) 



In retrospect, our causal process data type looks very much 
like the signal generator type. It just adds a parameter to the tran- 
sition function. Vice versa the signal generator data type could 
be replaced by a causal process with no input channel. We could 
express this by 

type Generator a = Causal () a 

where ( ) is a miliary tuple. However for clarity reasons we keep 
Generator and Causal apart. 

3.3. Internal parameters 

It is a common problem in signal processing that recursive filters 
1171 are cheap in execution but computation of their internal pa- 
rameters (mainly feedback coefficients) is expensive. A popular 
solution to this problem is to compute the filter parameters at a 
lower sampling rate |18, 19|. Usually, the filter implementations 
hide the existence of internal parameters and thus they have to cope 
with the different sampling rates themselves. 

In this project we choose a more modular way. We make the 
filter parameters explicit but opaque and split the filtering process 
into generation of filter parameters, filter parameter resampling 
and actual filtering. Static typing asserts that filter parameters can 
only be used with the respective filters. 

This approach has several advantages: 

• A filter only has to treat inputs of the same sampling rate. 
We do not have to duplicate the code for coping with input 
at rates different from the sample rate. 

• We can provide different ways of specifying filter parame- 
ters, e.g. the resonance of a lowpass filter can be controlled 
either by the slope or by the amplification of the resonant 
frequency. 

• We can use different control rates in the same program. 

• We can even adapt the speed of filter parameter generation 
to the speed of changes in the control signal. 

• For a sinusoidal controlled filter sweep we can setup a table 
of filter parameters for logarithmically equally spaced cut- 
off frequencies and ship this table at varying rates according 
to arcus sine. 

• Classical handling of control rate filter parameter compu- 
tation can be considered as resampling of filter parameters 
with constant interpolation. If there is only a small num- 
ber of internal filter parameters then we can resample with 
linear interpolation of the filter parameters. 

The disadvantage of our approach is that we cannot write 
something simple like lowpass (sine controlRate) 
(input sampleRate) anymore, but with Haskell's type 
class mechanism we let the Haskell compiler choose the right 
filter for a filter parameter type and thus come close to the above 
concise expression. 

3.4. Vectorisation 

Modern processors have vector units like AltiVec in PowerPC, 
SSE and MMX in X86 processors and Neon in ARM. These vector 
units are capable of performing an operation on multiple numbers 
at once. E.g. a processor equipped with the SSE1 extension can 
perform 4 single precision floating point multiplications with one 
instruction. In contrast to pure single-instruction-multiple-data 
(SIMD) architectures like todays GPUs these vector units also sup- 
port rearrangement of the vector components. Fortunately ILLVMl 
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supports vector units through a uniform interface and moreover it 
allows us to directly call processor specific instructions. This way 
we have implemented various optimisations for SSE. 
We have checked the use of vector units in four ways: 

• for parallel processing like filter banks, processing of stereo 
signals or in multi-oscillators generating a chorus effect, a 
mixture of harmonics or a chord, 

• for serial processing by dividing a signal into chunks of the 
length of the native vector size, 

• for pipeline processing like in an allpass cascade or an im- 
plementation of BUTTERWORTH or CHEBYSHEV filters by 
a cascade of second order filters 1171 . 

• for internal repetitive operations like dot products in filters. 

We prefer to choose one way for all involved processes in order to 
avoid expensive rearrangement of the data. Our experiments show 
that the possibility for pipelining is rare and moreover pipelining 
introduces a delay. This is especially a problem for our model of 
causal processes. The optimisation of internal operations is mostly 
restricted to filters. Parallel vectorisation is possible only in cases 
where we do the same operations in parallel and the maximum 
possible speedup can only be achieved if the number of parallel 
channels is a multiple of the native vector size. Serial vectorisation 
is almost always possible due to the nature of most basic signal 
processes. However it requires that the user accepts a reduction of 
the time resolution in cutting operations by the factor of the size 
of a native vector. It turns out that even recursive processes can be 
vectorised but we may reduce the number of computations from n 
to log 2 n only. 

Coincidentally the loops for scalar computation are often the 
same as the ones for parallel vectorisation and serial vectorisation. 
Only the parameters are different. The serially vectorised counter- 
part of an oscillator with initial phase p and frequency / for vectors 
of size 4 is an oscillator with initial phases (p, frac(p+/), frac(p+ 
2/) , frac (p + 3/) ) and frequencies (4/, 4/, 4/, 4/) . In both cases 
the computation of the next phases consists of an addition and the 
projection into the interval [0, 1). Thus serial vectorisation is just 
parallel vectorisation of processes that run at the same lower rate 
but at interleaved phases. We like to refer to Figure [4] again that 
shows real assembly code for a serial vectorisation. 

Many generators (linear ramps, exponential curves, polyno- 
mial curves implemented by difference schemes or in a direct way, 
noise by linear congruences) and stateless causal processes (mix- 
ing, ring modulation, convolution ("non-recursive filters"), phase 
modulation in oscillators, mapping from oscillator phase to wave 
value, distortion) can be vectorised in this style. 

The vectorisation of stateful causal processes is different, if it 
is possible at all. Consider an oscillator with a frequency modu- 
lated at sample rate. Computing the phases means computing the 
cumulative sum followed by a parallel computation of the fraction 
of all interim sums. E.g. the cumulative sum d of an 8-element 
vector a can be computed by 

b = a + a>l (1) 
c = 6 + 6 > 2 (2) 
d = c + c»4 (3) 

where x 3> n denotes shifting the vector x by n components up- 
wards, filling the least components with zeros. The same way we 



can write a first order filter with feedback factor k. 

6 = a + fc-a>l (4) 
c = 6 + k 2 ■ 6 » 2 (5) 
d = c + fc 4 • c » 4 (6) 

We can express this by the z-transformation of that filter: 

1 _ l + fc-z' 1 _ (l + fc-z' 1 ) ■ (1 + fc 2 ■ z- 2 ) 
1 - k ■ z- 1 ~ 1 - k 2 ■ z- 2 ~ l-k 4 -z~ 4 

Generally by extending a polynomial fraction with the alternating 
polynomial of the denominator we eliminate all monomials with 
odd exponent in the denominator. This way we can decompose a 
purely recursive filter of any order into a short-term non-recursive 
filter and a long-term recursive filter. For a second order filter we 
obtain 

1 _ 1 + a ■ z' 1 + b ■ z~ 2 

I -a- z- 1 +b- z- 2 ~ I- {a 2 -2-b) ■ z- 2 + b 2 ■ z- 4 

and we repeat that extension until the non-recursive filter mask 
reaches the native vector size, that is usually a power of two. In a 
real application we must cope with varying filter parameters. To 
this end we had to adjust the general principle in order to get the 
same results for scalar and vector implementations that are con- 
trolled at "vector rate" (sample rate divided by native vector size). 

3.5. Parameters at Runtime 

So far we have only considered signal generators with parameters 
that are hard-wired into the machine loop. This means however 
that when rendering a song we need to recompile the signal gener- 
ator for every tone according to its pitch, its velocity and the gate 
signal. In order to overcome this we let the user define a record 
type p that contains all parameters he wishes to control. The gen- 
erator type becomes Generator p a and e.g. the type of an 
exponential curve generator becomes 

exponential : : 

(p -> Float) -> (p -> Float) -> 
Generator p (Value Float) 

where the parameters are functions that get a value from the record 
p (record field selectors). The function 

compile : : 

Generator p (Value a) -> 
10 (p -> StorableVector a) 

compiles the generator via lLLVMl and returns a function that de- 
pends on the parameter set of type p. In our implementation we 
distinguish between constant parameters and open parameters and 
hard-wire constant parameters into the machine loop. 

4. RELATED WORK 

Our goal is to make use of the elegance of Haskell program- 
ming for signal processing. Our work is driven by the experience 
that today compiled Haskell code cannot compete with tradi- 
tional signal processing packages written in C. There has been a 
lot of progress in recent years, most notably the improved support 
for arrays without overhead, the elimination of temporary arrays 
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(fusion) and the Data-Parallel Haskell project 1 20] that aims at util- 
ising multiple cores of modern processors for array oriented data 
processing. However there is still a considerable gap in perfor- 
mance between idiomatic Haskell code and idiomatic C code. A 
recent development is an ILLVMl -backend for the Glasgow Haskell 
Compiler (GHC). that adds all of the low-level optimisations of 
ILLVMI to lGHCI However we still need some tuning of the high- 
level optimisation and a support for processor vector types in order 
to catch up with our lEDSLl method. 

In Section [2] we gave some general thoughts about possible 
designs of signal processing languages. Actually for many com- 
binations of features we find instances: The two well-established 
packages CSound 1 18 1 and SuperCollider 1191 are domain specific 
untyped languages that process data in a chunky manner. This 
implies that they have no problem with sharing signals between 
signal processors but they support feedback with short delay only 
by small buffers (slow) or by custom plugins (more development 
effort). Both packages support three rates: note rate, control rate 
and sample rate in order to reduce expensive computations of in- 
ternal (filter) parameters. With the Haskell wrappers (7] [8) it 
is already possible to control these programs as if they were part 
of Haskell, but it is not possible to exchange audio streams with 
them in real-time. This shortcoming is resolved with our approach. 

Another special purpose language is ChucK 1211 . Distinguish- 
ing features of ChucK are the generalisation to many different rates 
and the possibility of programming while the program is running, 
that is while the sound is playing. As explained in Section [331 
we can already cope with control signals at different rates, how- 
ever the management of sample rates at all could be better if it 
was integrated in our framework for physical dimensions. Since 
the Haskell systems Hugs and IGHCl both have a fine interac- 
tive mode, Haskell can in principle also be used for live coding. 
However it still requires better support by LLVM (shared libraries) 
and by our implementation. 

Efficient short-delay feedback written in a declarative manner 
can probably only be achieved by compiling signal processes to 
a machine loop. This is the approach implemented by the Struc- 
tured Audio Orchestra Language of MPEG-4 |22| and Faust 1231 . 
Faust started as compiler to the C++ programming language, but it 
does now also support LLVM. Its block diagram model very much 
resembles Haskell's arrows (Section [3.2b . A difference is that 
Faust's combinators contain more automatisms which on the one 
hand simplifies binding of signal processors and on the other hand 
means that errors in connections cannot be spotted locally. 

Before our project the compiling approach embedded in a gen- 
eral purpose language was chosen by Common Lisp Music |24|, 
Lua-AV (25), and Feldspar (Haskell) (26). 

Of all listed languages only ChucK and Haskell are strongly 
and statically typed, and thus provide an extra layer of safety. We 
like to count Faust as being weakly typed since it provides only 
one integer and one floating point type. 

5. BENCHMARKS 

We like to put the performance of our implementation in the con- 
text of existing signal processing packages. To this end we com- 
pare it with the well-established packages CSound 5.10 and Su- 
perCollider 3 on an X86 machine with SSSE3 in Table [T] The 
code for the examples can be found in the dafx2010 directory of 
our repository 1121 . SuperCollider is designed as real-time server, 
but we run it in non-real-time mode. In order to increase relative 





CSound 


SuperCollider 


scalar 


vector 


saw 


0.24 


0.09 


0.08 


0.02 


chord 


0.54 


0.23 


0.13 


0.04 


chordchorus 


1.54 


0.77 


0.49 


0.13 


ping 


0.25 


0.11 


0.10 


0.03 


butterworth 


1.22 


0.45 


0.74 


0.60 


allpass 


1.54 


0.58 


1.07 


0.84 


karplus 


1.64 


0.34 


0.08 


0.06 



Table 1 : Benchmarks for computing 8820000 (200 seconds at 
44100 Hz) single precision floating point samples. Computing 
times are given in seconds. 



time measuring precision we chose a large number of generated 
samples, namely 200 • 44100 that is about 10 million samples. We 
generate single precision float samples since this is the native for- 
mat of both packages. We measure the "user time" with UNIX's 
time, thus the time for writing to the disk is ignored. 

The columns "scalar" and "vector" refer to our implementa- 
tion with scalar and vector operations up to SSSE3, respectively. 
As far as we know both CSound and SuperCollider do not use the 
X86 vector unit in the tested versions. Compilation and optimisa- 
tion by LLVM-2.6 is part of the execution time for our implemen- 
tation, but it is really not noticeable. 

"Saw" denotes a sawtooth oscillator with constant frequency 
and no anti-aliasing. That is, for SuperCollider we use If Saw 
instead of the band-limited saw. It is certainly the most simple 
waveform we can generate. In CSound the particular waveform 
does not matter since the oscillator reads it from a table. "Ping" 
is the same sawtooth enveloped in an exponential curve and shall 
check whether it helps to fuse different signal generators into one 
loop. We have chosen the half-life large enough in order to not 
run into denormalised numbers. Their handling is expensive, but 
it may also be avoided by the round-denormalised-to-zero mode. 
"Chord" is a mixture of four sawtooth tones at different frequen- 
cies and shall demonstrate whether parallel vectorisation pays off. 
The number of tones is small enough to hold all loop variables 
in SSE registers. "Chordchorus" is a mixture of four chorus os- 
cillators, each consisting of four plain sawtooth oscillators, that 
is a total of 16 oscillators. Here the number of SSE registers is 
exceeded. "Butterworth" is a filter sweep of a 10th order BUT- 
TERWORTH lowpass filter applied to white noise where the control 
rate is a 100th of the sample rate. Since SuperCollider does not 
provide a higher order BUTTERWORTH filter we use a cascade of 
5 second order BUTTERWORTH filters (lpf ). The result is dif- 
ferent but the speed should be comparable. This example shall 
demonstrate that our vectorised implementation of second order 
filters actually gives a little increase in performance. The noise 
however is vectorised, too. "Allpass" is a phaser implemented as 
causal process using an allpass cascade. It is applied to the ex- 
pensive "Butterworth" generator in order to show the savings of 
sharing the original and the allpass filtered signal. However, for 
SuperCollider we have chosen a simple delay line because of the 
lack of a first order allpass. "Karplus-Strong" is a variant of the 
according algorithm using a feedback with delay of 100 samples 
and a first order lowpass. 

6. CONCLUSIONS AND FURTHER WORK 

The speed numbers of our implementation are excellent, yet the 
generating Haskell code looks idiomatic. The next step is the 
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integration of the current low-level implementation into our exist- 
ing framework for signal processing that works with real physical 
quantities and statically checked physical dimensions. There is 
also a lot of room for automated optimisations bv lGHCI rules. be 
it for vectorisation or for reduction of redundant computations of 
frac. 

We hope that ILLVMI supports GPUs in the future and thus 
makes them accessible by our framework. However this will cer- 
tainly need a concerted effort for the lLLVMl developers and it also 
requires an adapted design of our vectorisation, since GPUs have 
no notion of a vector and thus cannot shuffle vector elements. 
Actually ILLVMI can already generate C code, but we have not 
checked whether this is compatible with CUDA. 
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